Goto

Collaborating Authors

 protein function


Language models enable zero-shot prediction of the effects of mutations on protein function

Neural Information Processing Systems

Modeling the effect of sequence variation on function is a fundamental problem for understanding and designing proteins. Since evolution encodes information about function into patterns in protein sequences, unsupervised models of variant effects can be learned from sequence data. The approach to date has been to fit a model to a family of related sequences. The conventional setting is limited, since a new model must be trained for each prediction task. We show that using only zero-shot inference, without any supervision from experimental data or additional training, protein language models capture the functional effects of sequence variation, performing at state-of-the-art.


Hierarchical Multi-Label Contrastive Learning for Protein-Protein Interaction Prediction Across Organisms

Liu, Shiyi, Liang, Buwen, Fang, Yuetong, Jiang, Zixuan, Xu, Renjing

arXiv.org Artificial Intelligence

Recent advances in AI for science have highlighted the power of contrastive learning in bridging heterogeneous biological data modalities. Building on this paradigm, we propose HIPPO (HIerarchical Protein-Protein interaction prediction across Organisms), a hierarchical contrastive framework for protein-protein interaction(PPI) prediction, where protein sequences and their hierarchical attributes are aligned through multi-tiered biological representation matching. The proposed approach incorporates hierarchical contrastive loss functions that emulate the structured relationship among functional classes of proteins. The framework adaptively incorporates domain and family knowledge through a data-driven penalty mechanism, enforcing consistency between the learned embedding space and the intrinsic hierarchy of protein functions. Experiments on benchmark datasets demonstrate that HIPPO achieves state-of-the-art performance, outperforming existing methods and showing robustness in low-data regimes. Notably, the model demonstrates strong zero-shot transferability to other species without retraining, enabling reliable PPI prediction and functional inference even in less characterized or rare organisms where experimental data are limited. Further analysis reveals that hierarchical feature fusion is critical for capturing conserved interaction determinants, such as binding motifs and functional annotations. This work advances cross-species PPI prediction and provides a unified framework for interaction prediction in scenarios with sparse or imbalanced multi-species data.


Prot2Chat: Protein LLM with Early Fusion of Sequence and Structure

Wang, Zhicong, Ma, Zicheng, Cao, Ziqiang, Zhou, Changlong, Zhang, Jun, Gao, Yiqin

arXiv.org Artificial Intelligence

Proteins play a pivotal role in living organisms, yet understanding their functions presents significant challenges, including the limited flexibility of classification-based methods, the inability to effectively leverage spatial structural information, and the lack of systematic evaluation metrics for protein Q&A systems. To address these limitations, we propose Prot2Chat, a novel framework that integrates multimodal protein representations with natural language through a unified module, enabling large language model (LLM)-driven answer generation. Our model incorporates a modified ProteinMPNN encoder, which encodes protein sequence and structural information in a unified manner, a protein-text adapter with cross-attention mechanisms, and a LLaMA3 decoder. To optimize training efficiency, we freeze the encoder and employ LoRA techniques for the decoder. We conducted experiments on two datasets, both automated metrics and expert evaluations demonstrate the superior performance of our model. Furthermore, zero-shot prediction results highlight its strong generalization capabilities. This framework offers a promising solution for bridging protein domain knowledge with natural language understanding, paving the way for transformative advancements in protein-related research.


Language models enable zero-shot prediction of the effects of mutations on protein function

Neural Information Processing Systems

Modeling the effect of sequence variation on function is a fundamental problem for understanding and designing proteins. Since evolution encodes information about function into patterns in protein sequences, unsupervised models of variant effects can be learned from sequence data. The approach to date has been to fit a model to a family of related sequences. The conventional setting is limited, since a new model must be trained for each prediction task. We show that using only zero-shot inference, without any supervision from experimental data or additional training, protein language models capture the functional effects of sequence variation, performing at state-of-the-art.


Geometric Self-Supervised Pretraining on 3D Protein Structures using Subgraphs

Chatzianastasis, Michail, Dasoulas, George, Vazirgiannis, Michalis

arXiv.org Artificial Intelligence

Protein representation learning aims to learn informative protein embeddings capable of addressing crucial biological questions, such as protein function prediction. Although sequence-based transformer models have shown promising results by leveraging the vast amount of protein sequence data in a self-supervised way, there is still a gap in applying these methods to 3D protein structures. In this work, we propose a pre-training scheme going beyond trivial masking methods leveraging 3D and hierarchical structures of proteins. We propose a novel self-supervised method to pretrain 3D graph neural networks on 3D protein structures, by predicting the distances between local geometric centroids of protein subgraphs and the global geometric centroid of the protein. The motivation for this method is twofold. First, the relative spatial arrangements and geometric relationships among different regions of a protein are crucial for its function. Moreover, proteins are often organized in a hierarchical manner, where smaller substructures, such as secondary structure elements, assemble into larger domains. By considering subgraphs and their relationships to the global protein structure, the model can learn to reason about these hierarchical levels of organization. We experimentally show that our proposed pertaining strategy leads to significant improvements in the performance of 3D GNNs in various protein classification tasks.


Scientific Large Language Models: A Survey on Biological & Chemical Domains

Zhang, Qiang, Ding, Keyang, Lyv, Tianwen, Wang, Xinda, Yin, Qingyu, Zhang, Yiwen, Yu, Jing, Wang, Yuhao, Li, Xiaotong, Xiang, Zhuoyi, Zhuang, Xiang, Wang, Zeyuan, Qin, Ming, Zhang, Mengyao, Zhang, Jinlu, Cui, Jiyu, Xu, Renjun, Chen, Hongyang, Fan, Xiaohui, Xing, Huabin, Chen, Huajun

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have emerged as a transformative power in enhancing natural language comprehension, representing a significant stride toward artificial general intelligence. The application of LLMs extends beyond conventional linguistic boundaries, encompassing specialized linguistic systems developed within various scientific disciplines. This growing interest has led to the advent of scientific LLMs, a novel subclass specifically engineered for facilitating scientific discovery. As a burgeoning area in the community of AI for Science, scientific LLMs warrant comprehensive exploration. However, a systematic and up-to-date survey introducing them is currently lacking. In this paper, we endeavor to methodically delineate the concept of "scientific language", whilst providing a thorough review of the latest advancements in scientific LLMs. Given the expansive realm of scientific disciplines, our analysis adopts a focused lens, concentrating on the biological and chemical domains. This includes an in-depth examination of LLMs for textual knowledge, small molecules, macromolecular proteins, genomic sequences, and their combinations, analyzing them in terms of model architectures, capabilities, datasets, and evaluation. Finally, we critically examine the prevailing challenges and point out promising research directions along with the advances of LLMs. By offering a comprehensive overview of technical developments in this field, this survey aspires to be an invaluable resource for researchers navigating the intricate landscape of scientific LLMs.


Predicting Protein Functions using Machine Learning

#artificialintelligence

The CAFA competitions use the Gene Ontology (GO) as the mechanism to annotate proteins. The ontology is a set of controlled vocabulary that is used to precisely define a term in question. The GO, in particular, defines terms for three different protein-related aspects (a.k.a. The GO is an "always evolving" set as new terms are incorporated, refined or made obsolete. The terms in the GO ontology are organized in a tree structure where each of them is given a unique identifier.


A Review of Deep Learning Techniques for Protein Function Prediction

Aggarwal, Divyanshu, Hasija, Yasha

arXiv.org Artificial Intelligence

Deep Learning and big data have shown tremendous success in bioinformatics and computational biology in recent years; artificial intelligence methods have also significantly contributed in the task of protein function classification. This review paper analyzes the recent developments in approaches for the task of predicting protein function using deep learning. We explain the importance of determining the protein function and why automating the following task is crucial. Then, after reviewing the widely used deep learning techniques for this task, we continue our review and highlight the emergence of the modern State of The Art (SOTA) deep learning models which have achieved groundbreaking results in the field of computer vision, natural language processing and multi-modal learning in the last few years. We hope that this review will provide a broad view of the current role and advances of deep learning in biological sciences, especially in predicting protein function tasks and encourage new researchers to contribute to this area.


Machine learning helps predict protein functions

#artificialintelligence

To engineer proteins for specific functions, scientists change a protein sequence and experimentally test how that change alters its function. Because there are too many possible amino acid sequence changes to test them all in the laboratory, researchers build computational models that predict protein function based on amino acid sequences. Scientists have now combined multiple machine learning approaches for building a simple predictive model that often works better than established, complex methods.


Leveraging Sequence Embedding and Convolutional Neural Network for Protein Function Prediction

Tseng, Wei-Cheng, Chi, Po-Han, Wu, Jia-Hua, Sun, Min

arXiv.org Artificial Intelligence

The capability of accurate prediction of protein functions and properties is essential in the biotechnology industry, e.g. drug development and artificial protein synthesis, etc. The main challenges of protein function prediction are the large label space and the lack of labeled training data. Our method leverages unsupervised sequence embedding and the success of deep convolutional neural network to overcome these challenges. In contrast, most of the existing methods delete the rare protein functions to reduce the label space. Furthermore, some existing methods require additional bio-information (e.g., the 3-dimensional structure of the proteins) which is difficult to be determined in biochemical experiments. Our proposed method significantly outperforms the other methods on the publicly available benchmark using only protein sequences as input. This allows the process of identifying protein functions to be sped up.